\chapter{Classification based on malware's meta-data using decision tree approach}\label{chap:4}
%EXPERIMENTAL RESULTS AND ANALYSIS
%
%
This chapter describes Win32 executive files structure, known as PE files structure, which is used in classification system as malware meta-data. Afterwards, the machine learning technique called decision tree method is mentioned in classification malware system. Finally, we will propose the classification approach based on malware's meta-data using decision tree technique. 

\section{PE file format\cite{peheaderci}}
\subsection{The PE File Format}

The Portable Executable (PE) file format has been designed to be used by all Win32 based system. The general layout of a PE file is given in Figure \ref{fig:pefiles} . PE header consists of four parts: a MS-DOS session including DosMZ header and Dos stub; PE header; a section table; and images pages consisting of several different regions such as .text session, and .data session.
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{graph/pefiles.png}
\caption{The general layout of a PE file.}
\label{fig:pefiles}
\end{figure}

The PE file format starts with DOS MZ header. The first two bytes - "MZ" of the header is the DOS EXE signature. The first few hundred bytes of the typical PE file are taken up by the MS-DOS stub. This stub is a tiny program that displays a string as "This program cannot be run in MS-DOS mode.". 

The following part is PE header. The fields of this structure contain only the most basic information about the file such as the locations and sizes of the code and the data areas. These data structures of PE header will be explained shortly.

Following the PE header is the session table. The section table is an array of IMAGE\_SECTION\_HEADERs structures. An IMAGE\_SECTION\_HEADER provides information about its associated section such as location, length, and characteristics. The number of elements in this array is given in the PE header, known as \emph{NumberOfSections} field. There are commonly at least two sections in a PE file: code and data. Figure \ref{fig:pefiles1} shows a section table for a typical PE file.

\begin{figure}[h!]
\centering
\includegraphics[width=0.8\textwidth]{graph/dosHeaderStructure.png}
\caption{A section table for a typical PE file.}
\label{fig:pefiles1}
\end{figure}

That is all about the physical layout of the PE file format. The major steps in loading a PE file into memory is described as follows:

\begin{itemize}
\item When the PE file is running, the PE loader examines the DOS MZ header for the offset of the PE header. If it is found, it would skip to the PE header.
\item The PE loader checks if the PE header is valid. Then, it goes to the end of the PE header if the PE header is valid.
\item Following the PE header is the section table. The PE header reads information about the sections and maps those sections into memory using the file mapping.
\item After the PE file is mapped into memory, the PE loader concerns itself with the logical parts of the PE file, such as the import table.
\end{itemize}
\begin{figure}[h!]
\centering
\includegraphics[width=1\textwidth]{graph/pe1.png}
\caption{PE file format.}
\label{fig:pe1}
\end{figure}


\subsection{The PE Header}


\begin{figure}[httb]
\centering
\includegraphics[width=1\textwidth]{graph/peheader1.jpg}
\caption{Layout a file in the PE header format.}
\label{fig:peheader}
\end{figure}
The PE header is the second part of the PE files structure. The PE header starts with two characters "PE" known the PE header file signature. The main PE header is a structure of type IMAGE\_NT\_HEADERS, which is defined in WINNT.H. The structure of PE header consists of a DWORD and two substructures composed of FileHeader and OptionalHeader. IMAGE\_NT\_HEADERS contains the following three fields:
\begin{verbatim}
DWORD Signature;
IMAGE_FILE_HEADER FileHeader;
IMAGE_OPTIONAL_HEADER OptionalHeader;
\end{verbatim}
Following the PE signature DWORD in the PE header is a structure of type IMAGE\_FILE\_HEADER. 
\begin{verbatim}
typedef struct _IMAGE_FILE_HEADER {
  WORD  Machine;
  WORD  NumberOfSections;
  DWORD TimeDateStamp;
  DWORD PointerToSymbolTable;
  DWORD NumberOfSymbols;
  WORD  SizeOfOptionalHeader;
  WORD  Characteristics;
} IMAGE_FILE_HEADER, *PIMAGE_FILE_HEADER;
\end{verbatim}
The fields of this structure contain only the most basic information about the file. The fields of the IMAGE\_FILE\_HEADER include:
\begin{itemize}
\item Machine: The CPU that this file is intended for.
\item NumberOfSections: The number of the members in section
\item TimeDateStamp: The time that the linker
\item PointerToSymbolTable: The file offset of the COFF symbol table. 
\item NumberOfSymbols: The number of symbols in the COFF symbol table.
\item SizeOfOptionalHeader: The size of an optional header that can follow this structure.
\item Characteristics: The size of an optional header can follow this structure. If there are no relocations in this file, characteristics will be 0x0001. Otherwise, the file is an executable image, characteristics will be 0x0002.  If the file is a dynamic-link library, not a program, characteristics will be 0x2000.
\end{itemize}

Following the file header is the optional header which consists of three parts: COFF standard fields, Windows specific fields, and data directories. The first part includes the COFF standard field which contains the sizes of various parts of code, and the \emph{AddressOfEntryPoint} which indicates the location of the entry point of the application and the location of the end of the Import Address Table(IAT). Then, the second part contains the information of OS. Finally, the Data directories field contains the information of Session table.

The following is the structure of optional header. 
\begin{verbatim}
typedef struct _IMAGE_OPTIONAL_HEADER {
  WORD                 Magic;
  BYTE                 MajorLinkerVersion;
  BYTE                 MinorLinkerVersion;
  DWORD                SizeOfCode;
  DWORD                SizeOfInitializedData;
  DWORD                SizeOfUninitializedData;
  DWORD                AddressOfEntryPoint;
  DWORD                BaseOfCode;
  DWORD                BaseOfData;
  DWORD                ImageBase;
  DWORD                SectionAlignment;
  DWORD                FileAlignment;
  WORD                 MajorOperatingSystemVersion;
  WORD                 MinorOperatingSystemVersion;
  WORD                 MajorImageVersion;
  WORD                 MinorImageVersion;
  WORD                 MajorSubsystemVersion;
  WORD                 MinorSubsystemVersion;
  DWORD                Win32VersionValue;
  DWORD                SizeOfImage;
  DWORD                SizeOfHeaders;
  DWORD                CheckSum;
  WORD                 Subsystem;
  WORD                 DllCharacteristics;
  DWORD                SizeOfStackReserve;
  DWORD                SizeOfStackCommit;
  DWORD                SizeOfHeapReserve;
  DWORD                SizeOfHeapCommit;
  DWORD                LoaderFlags;
  DWORD                NumberOfRvaAndSizes;
  IMAGE_DATA_DIRECTORY DataDirectory[IMAGE_NUMBEROF_DIRECTORY_ENTRIES];
} IMAGE_OPTIONAL_HEADER, *PIMAGE_OPTIONAL_HEADER;
\end{verbatim}

The fields of Optional header are described as follow: 
\begin{itemize}
\item Magic: appears to be set to 0x010B.
\item MajorLinkerVersion: "The major version number of the linker."
\item MinorLinkerVersion: "The minor version number of the linker."
\item SizeOfCode: "The size of the code section, in bytes, or the sum of all such sections if there are multiple code sections."
\item SizeOfInitializedData: "The size of the initialized data section, in bytes, or the sum of all such sections if there are multiple initialized data sections."
\item SizeOfUninitializedData: "The size of the uninitialized data section, in bytes, or the sum of all such sections if there are multiple uninitialized data sections."
\item AddressOfEntryPoint: "A pointer to the entry point function, relative to the image base address. For executable files, this is the starting address. For device drivers, this is the address of the initialization function. The entry point function is optional for DLLs. When no entry point is present, this member is zero."
\item BaseOfCode: "A pointer to the beginning of the code section, relative to the image base."
\item BaseOfData: "A pointer to the beginning of the data section, relative to the image base."
\item ImageBase: "The preferred address of the first byte of the image when it is loaded in memory."
\item SectionAlignment: "The alignment of sections loaded in memory, in bytes."
\item FileAlignment: "The alignment of the raw data of sections in the image file, in bytes."
\item MajorOperatingSystemVersion: "The major version number of the required operating system."
\item MinorOperatingSystemVersion: "The minor version number of the required operating system."
\item MajorImageVersion: "The major version number of the image."
\item MinorImageVersion: "The minor version number of the image."
\item MajorSubsystemVersion: "The major version number of the subsystem."
\item MinorSubsystemVersion: "The minor version number of the subsystem."
\item Reserved1: "This member is reserved and must be 0."
\item SizeOfImage: "The size of the image, in bytes, including all headers."
\item SizeOfHeaders: "The combined size of the MS-DOS stub, the PE header, and the section headers, rounded to a multiple of the value specified in the FileAlignment member."
\item CheckSum: "The image file checksum."
\item Subsystem: "The subsystem required to run this image."
\item DllCharacteristics: "The DLL characteristics of the image. "
\item SizeOfStackReserve:"The number of bytes to reserve for the stack."
\item SizeOfStackCommit: "The number of bytes to commit for the stack."
\item SizeOfHeapReserve: "The number of bytes to reserve for the local heap."
\item SizeOfHeapCommit: "The number of bytes to commit for the local heap."
\item LoaderFlags: "This member is obsolete."
\item NumberOfRvaAndSizes: "The number of directory entries in the remainder of the optional header. Each entry describes a location and size."
\item DataDirectory: "A pointer to the first IMAGE\_DATA\_DIRECTORY structure in the data directory."
\end{itemize}

\section{Decision tree\cite{decisiontree}}

Decision tree is a machine learning technique for building knowledge-based systems by inductive inference from examples. Decision tree is to create a model that predicts the value of a target variable based on several input variables. An example is shown on the Figure \ref{fig:decisiontreesample} from the data Table \ref{table:decisiontreesampletable}.Each interior node corresponds to one of the input variables. There are edges to children for each of the possible values of that input variable. Each leaf shows a value of the target variable.
\begin{table}
  \begin{center}
    \begin{tabular}{ | l | l | l | l | l }
     \hline
    Condition & Class & Attribute 1 &  Attribute \\ \hline
    1 & A & 0 & 0 & ... \\ 
	2 & B & 1 & 0 & ... \\ 
	3 & C & 1 & 1 & ... \\ 
	... & ... & ... & ... & ... \\ 


    \end{tabular}
	\end{center}
     \caption{The data table.}
    \label{table:decisiontreesampletable}
\end{table}

\begin{figure}[httb]
\centering
\includegraphics[width=1\textwidth]{graph/decisiontreesample.png}
\caption{Decision tree sample.}
\label{fig:decisiontreesample}
\end{figure}

In data mining, the decision tree technique can be described also as the combination of mathematical and computational techniques to aid the description, categorization and generalization of a given set of data.

Data comes in records of the form
\begin{equation}
    (x,Y) = (x_1, x_2, x_3, ..., x_k, Y) 
\end{equation}

The dependent variable, $Y$ which is the target variable is analysed to understand, classify or generalise. The vector x is composed of the input variables, $x_{1}, x_{2}, x_{3}$ that are used for that task.
\section{Classification based on malware's meta-data using decision tree approach}

To make malware classification rapid and correct, decision tree algorithm is used in malware classification system.

As the previous section, data comes in records of the form:
\begin{equation}
    (x,Y) = (x_1, x_2, x_3, ..., x_k, Y) 
\end{equation}

In malware classification system, Y is malware families name, $x_{1}, x_{2}$,...,$x_{3}$ is malware file's meta-data. Malware name is provided by anti-virus engines that presented in chapter \ref{chap:2} cannot used in this system. The new malware families is required to detect meaningful characterization of malware. This classification system use famous malware families reported in Information technology agency such as netsky, mydoom, autorun, mytob, and other famous malware families. Therefore, when new malware is classified into famous malware families, researcher detect some meaningful characterize of malware, and reduce time for malware analysis. 
\section{Conclusion}
In this chapter, our approach is proposed with the technique used in malware classification system.
In the next chapter, propose the system design and implementation will be proposed.